By: Kris Ghimire, Thad Schwebke, Walter Lai, and Jamie Vo
Photo Cred.: Photo by kat wilcox from Pexels
# Load in libraries
# general libraries
import pandas as pd
import numpy as np
import os
# hide warnings
import warnings
warnings.filterwarnings('ignore')
# visualizations libraries
import seaborn as sns
import plotly
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from geopy.geocoders import Nominatim
%matplotlib inline
# Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.utils import resample
from sklearn.feature_selection import RFE
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
import statsmodels.api as sm
import random
from IPython.display import HTML
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
(i.e., why was this data collected in the first place?).
The Murder Accountability Project is a nonprofit organization that discovers discrepancies between the reported homicides between medical examiners and the FBI voluntary crime report. The database is considered to be one of the most exhaustive record collection of homicides that is currently avaiable for the US. Additional information about the organization can be found at Murder Accountability Project.
The dataset dates back to 1967 and includes demographic information such as gender, age, and ethnicity. A more in depth description of the attributes may be found in the Data Description section.
# read in the data
df = pd.read_csv('../Data/database.csv')
# print the number of records and columns
records = len(df)
attributes = df.columns
print(f'No. of Records: {records} \nNo. of Attributes: {len(attributes)}')
Describe how you would define and measure the outcomes from the dataset. That is, why is this data important and how do you know if you have mined useful knowledge from the dataset?
How would you measure the effectiveness of a good prediction algorithm? Be specific.
df_description = pd.read_excel('../Data/data_description.xlsx')
pd.set_option('display.max_colwidth', 0)
df_description
Explain any missing values, duplicate data, and outliers. Are those mistakes? How do you deal with these problems? Be specific.
Missing Values
Duplicates
Outliers
df_duplicates = df.groupby(df.columns.tolist(),as_index=False).size()
df_duplicates.loc[df_duplicates['size'] > 1]
Give simple, appropriate statistics (range, mode, mean, median, variance, counts, etc.) for the most important attributes and describe what they mean or if you found something interesting. Note: You can also use data from other sources for comparison. Explain the significance of the statistics run and why they are meaningful.
# basic statistics of categorical data
df_categorical = df.select_dtypes(include='object')
df_categorical.describe()
# get all levels per categorical attribute
df_categorical_levels = pd.DataFrame()
df_categorical_levels['Attribute'] = df_categorical.columns
df_categorical_levels['Levels'] = ''
df_categorical_levels['Levels_Count'] = ''
df_categorical_levels['Unknown_Count'] = ''
# populate the dataframe with categorical levels and count of each category
for i, row in df_categorical_levels.iterrows():
attribute = row['Attribute']
df_categorical_levels.at[i,'Levels'] = df[attribute].unique()
df_categorical_levels.at[i,'Levels_Count'] = len(df[attribute].unique())
try:
df_categorical_levels.at[i,'Unknown_Count'] = df.groupby(attribute).count().loc['Unknown'][0]
except:
df_categorical_levels.at[i,'Unknown_Count'] = 0
# show the dataframe
df_categorical_levels.sort_values(by='Unknown_Count', ascending = False)
Attributes with the greatest amount of missing data are ethnicity, relationship, and perpetrator race/sex.
# basic statistics for continuous variables
df.describe()
Visualize the most important attributes appropriately (at least 5 attributes). Important: Provide an interpretation for each chart. Explain for each attribute why the chosen visualization is appropriate.
fig = px.scatter_matrix(df[['Year', 'Incident', 'Victim Age', 'Victim Count','Perpetrator Count']])
fig.show()